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Abstract 

Hardness results for maximum agreement problems have 
close connections to hardness results for proper learning 
in computational learning theory. In this paper we prove 
two hardness results for the problem of finding a low 
degree polynomial threshold function (PTF) which has 
the maximum possible agreement with a given set of 
labeled examples in R" x { — 1, 1}. We prove that for any 
constants d ^ 1, e > 0, 

• Assuming the Unique Games Conjecture, no 
polynomial-time algorithm can find a degree-d PTF 
that is consistent with a (5 + e) fraction of a given 
set of labeled examples in R™ X {— 1, 1}, even if there 
exists a degree-d PTF that is consistent with a 1 — e 
fraction of the examples. 

• It is NP-hard to find a degree-2 PTF that is consis- 
tent with a (5 + e) fraction of a given set of labeled 
examples in R" x {—1, 1}, even if there exists a half- 
space (degree- 1 PTF) that is consistent with a 1 — e 
fraction of the examples. 

These results immediately imply the following hard- 
ness of learning results: (i) Assuming the Unique Games 
Conjecture, there is no better-than-trivial proper learning 
algorithm that agnostically learns degree-ei PTFs under 
arbitrary distributions; (ii) There is no better-than-trivial 
learning algorithm that outputs degree-2 PTFs and ag- 
nostically learns halfspaces (i.e. degree-1 PTFs) under 
arbitrary distributions. 
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1 Introduction 

A polynomial threshold function (PTF) of degree 
d is a function / : R™ — > { — 1,+1} of the form 
f{x) — sign(p(x)), where 



p(x) 



.1 j 



multiset SC[n],|S|<d ieS 



is a degree-G? multivariate polynomial with real coeffi- 
cients. Degree-1 PTFs are commonly known as half- 
spaces or linear threshold functions, and have been 
intensively studied for decades in fields as diverse 
as theoretical neuroscience, social choice theory and 
Boolean circuit complexity. 

The last few years have witnessed a surge of re- 
search interest and results in theoretical computer 
science on halfspaces and low-degree PTFs, see e.g. 
[251 EH E EJl E] . One reason for this interest is 
the central role played by low-degree PTFs (and half- 
spaces in particular) in both practical and theoretical 
aspects of machine learning, where many learning al- 
gorithms either implicitly or explicitly use low-degree 
PTFs as their hypotheses. More specifically, several 
widely used linear separator learning algorithms such 
as the Perceptron algorithm and the "maximum mar- 
gin" algorithm at the heart of Support Vector Ma- 
chines output halfspaces as their hypotheses. These 
and other halfspace-based learning methods are com- 
monly augmented in practice with the "kernel trick," 
which makes it possible to efficiently run these al- 
gorithms over an expanded feature space and thus 
potentially learn from labeled data that is not lin- 
early separable in R™. The "polynomial kernel" is a 
popular kernel to use in this way; when, as is usu- 
ally the case, the degree parameter in the polynomial 
kernel is set to be a small constant, these algorithms 
output hypotheses that are equivalent to low-degree 
PTFs. Low-degree PTFs are also used as hypothe- 



scs in several important learning algorithms with a 
more complexity-theoretic flavor, such as the low- 
degree algorithm of Linial et al. |21| and its variants 
[L2"l [22] , including some algorithms for distribution- 
specific agnostic learning [UJ [201 ED E] • 

Given the importance of learning algorithms that 
construct low-degree PTF hypotheses, it is a natural 
goal to study the limitations of learning algorithms 
that work in this way. On the positive side, it is well 
known that if there is a PTF (of constant degree d) 
that is consistent with all the examples in a data 
set, then a consistent hypothesis can be found in 
polynomial time simply by using linear programming 
(with the Q(n d ) monomials of degree at most d as the 
variables in the LP). However, the assumption that 
some low-degree PTF correctly labels all examples 
seems quite strong; in practice data is often noisy or 
too complex to be consistent with a simple concept. 
Thus we are led to ask: if no low-degree PTF classifies 
an entire data set perfectly, to what extent can the 
data be learned using low-degree PTF hypoptheses? 

In this paper, we address this question under the 
agnostic learning framework [TTJ [Hj . Roughly speak- 
ing, a function class C is agnostically learnable if we 
can efficiently find a hypothesis that has accuracy ar- 
bitrarily close to the accuracy of the best hypothesis 
in C. Uniform convergence results imply that 
learnability in this model is essentially equivalent to 
the ability to come up with a hypothesis that cor- 
rectly classifies almost as many examples as the op- 
timal hypothesis in the function class. This problem 
is sometimes referred to as a "Maximum Agreement" 
problem for C. As we now describe, this problem has 
previously been well studied for the class C of halfs- 
paces. 

Related Work. The Maximum Agreement prob- 
lem for halfspaces over W 1 was shown to be NP- 
hard to approximate within some constant factor in 
[TJ [2J. The inapproximability factor was improved 
to 84/85+ e in [3], which showed that this hardness 
result applies even if the examples must lie on the n- 
dimensional Boolean hypercube. Finally, a tight in- 
approximability result was established independently 
in [TU] and [TJ; these works showed that for any con- 
stant e > 0, it is NP-hard to find a halfspace consis- 
tent with (| + e) of the examples even if there exists 
a halfspace consistent with (1 — e) of the examples. 
(It is trivial to find a halfspace consistent with half of 
the examples since either the constant-0 or constant- 1 
halfspace will suffice.) The reduction in [7J produced 
examples with real-valued coordinates, whereas the 
proof in [10] yielded examples that lie on the Boolean 
hypercube. 

Thanks to these results the Maximum Agreement 



problem is well-understood for halfspaces, but the 
situation is very different for low-degree PTFs. Even 
for degree-2 PTFs no hardness results were previously 
known, and recent work [6] has in fact given efficient 
agnostic learning algorithms for low-degree PTFs 
under specific distributions on examples such as 
Gaussian distributions or the uniform distribution 
over { — 1, 1}" (though it should be noted that these 
distribution-specific agnostic learning algorithms for 
degree-d PTFs are not proper - they output PTF 
hypotheses of degree ^> d). In this paper we make 
the first progress on this problem, by establishing 
strong hardness of approximation results for the 
Maximum Agreement problem for low-degree PTFs. 
Our results directly imply corresponding hardness 
results for agnostically learning low degree PTFs 
under arbitrary distributions; we present all these 
results below. 

Main Results. Our main results are the follow- 
ing two theorems. The first result establishes UGC- 
hardness of finding a nontrivial degree-d PTF hypoth- 
esis even if some degree-d PTF has almost perfect 
accuracy: 

Theorem 1.1. Fix e > 0, d ^ 1. Assuming the 
Unique Games Conjecture, no polynomial-time algo- 
rithm can find a degree-d PTF that is consistent with 
(i + e) fraction of a given set of labeled examples in 
l n x {—1, 1}, even if there exists a degree-d PTF that 
is consistent with a 1 — e fraction of the examples. 

The second result shows that it is NP-hard to 
find a degree-2 PTF hypothesis that has nontrivial 
accuracy even if some halfspace has almost perfect 
accuracy: 

Theorem 1.2. Fix e > 0. It is NP-hard to find a 
degree-2 PTF that is consistent with (5 + e) fraction 
of a given set of labeled examples in R ra x {—1,1}, 
even if there exists a halfspace (degree-1 PTF) that is 
consistent with a 1 — e fraction of the examples. 

As noted above, both problems become easy 
(using linear programming) if the best hypothesis is 
assumed to have perfect agreement with the data set 
rather than agreement 1 — e, and it is trivial to find 
a (constant-valued) hypothesis with agreement rate 
1/2 for any data set. Thus the parameters in both 
hardness results are essentially the best possible. 

These results can be rephrased as hardness of 
agnostic learning results in the following way: (i) 
Assuming the Unique Games Conjecture, even if 
there exists a degree-d PTF that is consistent with 
1 — e fraction of the examples, there is no efficient 
proper agnostic learning algorithm that can output 



a degree-d PTF correctly labeling more than i + e 
fraction of the examples; (ii) Assuming P ^ NP, 
even if there exists a halfspace that is consistent with 

1 — e fraction of the examples, there is no efficient 
agnostic learning algorithm that can find a degree-2 
PTF correctly labeling more than | + e fraction of 
the examples. 

Organization. In Section [5] we present the 
complexity-theoretic basis (the Unique Games con- 
jecture and the NP-hardness of Label Cover) of our 
hardness results. In Section [3] we sketch a new proof 
of the hardness of the Maximum Agreement problem 
for halfspaces, and give an overview of how the proofs 
of Theorems 11.11 and 1 1 . 2 1 build on this basic argument. 
In Sections [4] and [5] we prove Theorems 11.11 and 11.21 

Notational Preliminaries: For n E Z + we denote 
by [n] the set {1, ...,n}. For i,j e Z+, i ^ j, we 
denote by [i,j] the set {i, i + We write 

{j : m} to denote the multi-set that contains m copies 
of the element j. We write xs( x ) to denote Yl ieS Xi ' 
the monomial corresponding to the multiset S. 

2 Complexity-theoretic preliminaries 

We recall the Unique Games problem that was intro- 
duced by Khot [17]: 

Definition 2.1. A Unique Games instance C is 
defined by a tuple (U, V, E, k, II). Here U and V 
are the two vertex sets of a regular bipartite graph 
and E is the set of edges between U and V . H is a 
collection of bijections, one for each edge: II = {7r e : 
[k] — > [k]} e £E where each n e is a bijection on [k]. 
A labeling £ is a function that maps U — > [k] and 
V — > [k]. We say that an edge e = (u, v) is satisfied 
by labeling £ if w e (£(v)) — £(u). We define the value 
of the Unique Games instance C, denoted Opt(£), to 
be the maximum fraction of edges that can be satisfied 
by any labeling. 

The Unique Games Conjecture (UGC) was pro- 
posed by Khot in [17] and has led to many improved 
hardness of approximation results over those which 
can be achieved assuming only P ^ NP: 

Conjecture 2.2 (Unique Games Conjecture). 

Fix any constant n > 0. For sufficiently large 
k = k(rj), given a Unique Games instance C = 
(U, V, E, k, II) that is guaranteed to satisfy one of the 
following two conditions, it is NP-hard to determine 
which condition is satisfied: Opt(£) ^ 1 — 77, or 
Opt(£) < ' 

1 We use the statement from |18| which is equivalent to the 
original Unique Games Conjecture. 



Our first hardness result, Theorem ll.il is proved 
under the the Unique Games Conjecture. Our second 
hardness result, Theorem 11.21 uses only the assump- 
tion that P 7^ NP; the proof employs a reduction 
from the Label Cover problem, defined below. 

Definition 2.3. A Label Cover instance C is defined 
by a tuple (U,V, E,k,m,H). Here U and V are the 
two vertex sets of a regular bipartite graph and E is 
the set of edges between U and V . II is a collection 
of "projections", one for each edge: II = {n e : [m] — > 
[k]} e £E and m,k are positive integers. A labeling £ 
is a function that maps U — > [k] and V [m] . We 
say that an edge e = (u, v) is satisfied by labeling £ 
if Tr e (£(v)) = £{u). We define the value of the Label 
Cover instance, denoted Opt(£), to be the maximum 
fraction of edges that can be satisfied by any labeling. 

We use the following theorem [23] which estab- 
lishes NP-hardness of a "gap" version of Label Cover: 

Theorem 2.4. Fix any constant 77 > 0. Given a La- 
bel Cover instance C — (U, V, E, k, m, II) that is guar- 
anteed to satisfy one of the following two conditions, 
it is N~P-hard to determine which condition is satis- 
fied: Opt(£) = 1, or Opt(£) 1/m''. 

3 Overview of our arguments 

To illustrate the structure of our arguments, let us 
begin by sketching a proof of the following hardness 
result for the Maximum Agreement problem for half- 
spaces: 

Proposition 3.1. Assuming the Unique Games 
Conjecture, no polynomial-time algorithm can find 
a halfspace (degree-1 PTF) that is consistent with 
(i + e ) fraction of a given set of labeled examples in 
R" x { — 1, 1}, even if there exists a halfspace that is 
consistent with a 1 — e fraction of the examples. 

As mentioned above, the same hardness result 
(based only on the assumption that P ^ NP) has 
already been established in [10] ; indeed, we do not 
claim Proposition 13. II as a new result. However, the 
argument sketched below is different from (and, we 
believe, simpler than) the other proofs; it helps to 
illustrate how we eventually achieve the more general 
hardness results Theorems 1 1 . 1 1 and 1 1 . 2 1 

Proof Sketch for Proposition 13.11 We describe 
a reduction that maps any instance C of Unique 
Games to a set of labeled examples with the following 
guarantee: if Opt(£) is very close to 1 then there 
is a halfspace that agrees with 1 — e fraction of the 
examples, while if Opt(£) is very close to then no 
halfspace agrees with more than \ + e fraction of the 



examples. A reduction of this sort directly yields 
Proposition 13. II 

Let C = (U, V, E, k, II) be a Unique Games 
instance. Each example generated by the reduction 
has (|V| + |Z7|)fc coordinates, i.e. the examples lie 
in Rd^l+I^Dfc. The coordinates should be viewed as 
being grouped together in the following way: there 
is a block of k coordinates for each vertex w in 
U U V. We index the coordinates of x e R(\ u \+\ v \) k 
as x = (x^w ) where w G U U V and i G [fe] . 

Given any function / : R(\ u \+W\)k {-1, 1} 
and vertex w € U U V, we write f w to denote the 
restriction of / to the k coordinates (xw )»e[fc] that is 
obtained by setting all other coordinates (x^}) w >^ w 
to 0. Similarly, for e = {u, v} an edge in U X V, we 
write f e for the restriction that fixes all coordinates 
( x ^w')w£e to and leaves the 2k coordinates Xu , Xv 
unrestricted. 

For every labeling £ : UU V — > [k] of the instance, 
there is a corresponding halfspace over R(l v 'l+I c/ I) fc 

sign(]T - £ « 

Given a Unique Games instance C, the reduction 
constructs a distribution T> over labeled examples 
such that if Opt(£) is almost 1 then the above 
halfspace has very high accuracy w.r.t. T>, and any 
halfspace that has accuracy at least | + e yields a 
labeling that satisfies a constant fraction of edges in 
C. A draw from T> is obtained by first selecting a 
uniform random edge e = {u, v} from E, and then 
making a draw from 2? e , where T> e is a distribution 
over labeled examples that we describe below. 

Fix an edge e = (u, v). For the sake of exposition, 
let us assume the mapping tt £ S II associated with e 
is the identity permutation, i.e. 7r e (i) = i for every 
is [fe]. The distribution T> e will have the following 
properties: 

(i) For every (y, b) in the support of T> ei all coordi- 
nates y$ for every vertex w (£ e are zero. 

(ii) For every label i € [fc], the halfspace signal* •* — 
io ) has accuracy 1 — e w.r.t.2? e . 

(iii) If sign(/ e ) is a halfspace that has accuracy 
at least | + e w.r.t. 2? e , then the functions 
f u , /„ can each be individually "decoded" to 
a "small" (constant-sized) set S u , S v C [fe] of 
labels such that S u D S v ^ (so a labeling 
that satisfies a nonnegligible fraction of edges in 
expectation can be obtained simply by choosing 
a random label from S w for each w - such a 
random choice will satisfy each edge's bijection 



with constant probability, so in expectation will 
satisfy a constant fraction of constraints). 



Let us explain item (iii) in more detail. Since 
the distribution T> e is supported on vectors y that 
have the (yw) w a e coordinates all 0, the distribution 
T> e only "looks at" the restriction f e of /, which 
is a halfspace on M. 2k . Thus achieving (iii) can be 
viewed as solving a kind of property testing problem 
which may loosely be described as "Matching dictator 
testing for halfspaces." To be more precise, what 
is required is a distribution T> e over 2fe-dimensional 
labeled examples and a "decoding" algorithm A 
which takes as input a fc-variable halfspace and 
outputs a set of coordinates. Together these must 
have the following properties: 



• (Completeness) If f e {x) = x£ — Xy then 
sign(/ e (y)) = b with probability 1 — e for (y, b) ~ 



• (Soundness) If f e is such that sign(/ e (y)) = b 
with probability at least 1/2 + e for (y, b) drawn 
from X> e , then the output sets A(f u ), A(f v ) of 
the decoding algorithm (when it is run on f u and 
f v respectively) are two small sets that intersect 
each other. 



Testing problems of this general form are often re- 
ferred to as Dictatorship Testing; the design and anal- 
ysis of such tests is a recurring theme in hardness of 
approximation. 

We give a "matching dictator test for halfspaces" 
below. More precisely, in the following figure we de- 
scribe the distribution T> e over examples (the decod- 
ing algorithm A is described later). 



71: Matching Dictatorship Test for 
Halfspaces 

Input: A halfspace f e : M. 2k -> K. 
Set e :=,^,*:= l/2 fe . 

1. Generate independent 0/1 bits 01, aa, • . • , a>k 
each with E[<Zj] = e. Generate 2k inde- 
pendent N(0, 1) Gaussian random variables: 
hi, hi . . . i hk,gi,g2 ■ ■ ■ ,gk- Generate a ran- 
dom bit b e {— 1, 1}. 

2. Set r = (oi/ii + gi,...,auhk + ffk,ffi, • • • ,5fc) 
and w = (1,...,1,0, ...,0) € R 2fc to be the 
vector whose first fc coordinates are 1 and last 
fe coordinates are 0. 

3. Set y = r + 65cl>. The result of a draw from 
T> e is the labeled example (y, b). 

The test checks whether sign(/ e (y)) equals b. 

It is useful to view the test in the following light: 
Let us write f e (x) as 0+X)£=i io« #u +IIi=i tw« a;« , 
and let us suppose that J2i=i\ w ^\ = 1 ( as l° n g 
as some Wu is nonzero this is easily achieved by 
rescaling; for this intuitive sketch we ignore the case 
that all w$ are 0, which is not difficult to handle). 
Then we have f e {y) = f e {r)+bS, and we may view the 
test as randomly choosing one of the two inequalities 
/ e (r) — 5 < 0, f e (r) — 5 > and checking that it 
holds. Since at least one of these inequalities must 
hold for every f e , the probability that /„ passes the 
test is \ + |Pr r [/ e (r) G [— 6, 5)]. This interpretation 
will be useful both for analyzing completeness and 
soundness of the test. 

For completeness, it is easy to see that the 
"matching dictator" function f e (x) = x^' — Xv has 
/ e (r) = dihi and thus Pr[/ e (r) = 0] = 1 — e, so this 
function indeed passes the test with probability 1 — e. 

The soundness analysis, which we now sketch, 
is more involved. Let / be such that Pr r [/ e (r) € 
1-5,5)} > 2e. Since f e (r) = + w^) 9i + 

y~] Wu^ ciihi and gi, hi are i.i.d. Gaussians, conditioned 
on a given outcome of the aj-bits the value f e (r) 
follows the Gaussian distribution with mean and 
variance J2(w ( : ] + wi i] ) 2 + E(a^i 2) ) 2 - Now recall 
that an N(0, a) Gaussian random variable lands in 
the interval [— t,t] with probability at most 0(t/a). 
So any a- vector for which the variance Y](wu^ + 
wi^) 2 + Y^i a i w u^) 2 is not "tiny" can contribute only 
a negligible amount to the overall probability that 
f e (r) lies in [—6,6) (recall that 5 is extremely tiny). 



Since by assumption Pr r [/ e (r) € [— 6,5)] is non- 
negligible (at least 2e), there must be a non- negligible 
fraction of a- vector outcomes that make the variance 

J2(w$ +w$) 2 +J2(a>iW$) 2 be "tiny." This implies 

(i) 

that there must be only a "few" coordinates Wu 
for which [wu \ is not tiny (for if there were many 

non-tiny Wu ^ coordinates, then J2i( w ^ a i) 2 would be 
non-tiny with probability nearly 1 over the choice of 
the a- vector). Moreover, ii>„ + must be ps for 

each i, so for each i the magnitudes [wu | and \wv^\ 
must be nearly equal; and in particular, each [wu | 
is large if and only if [wi 1 ^ is large. Finally, since 
J2i \ w u \ equals 1 some Wu s must be large (at least 
1/fe). 

With these facts in place, the appropriate de- 
coding algorithm A is rather obvious: given f u = 
8 + J2i=i w u%u as input, A outputs the set S u of 
those coordinates i for which [w^ \ is large (and sim- 
ilarly for /„). This set cannot be too large since 

Si=i \ w u \ e( l u als 1. Now a labeling that satisfies 
edge e with non-negligible probability can be ob- 
tained by outputing a random element from S u and 
a random element from S v ; since these sets are small 
there is a non-negligible probability that the labels 
will match as required. This concludes the proof 
sketch of Proposition 13. II □ 

Overview of the proofs of Theorems 11.11 

and 11.21 For Theorem 11.11 (hardness of properly 
learning degree-d PTFs), we must deal with the 
additional complication of handling the cross-terms 
such CIS Xn Xy between it-variables and ^-variables 
that may be present in degree-d PTFs. As an ex- 
ample of how such cross-terms can cause problems, 
observe that the degree-3 polynomial f e = {x^u — 
x< ^) XX 2 ^) 2 would pass the test 71 with high prob- 
ability, but this polynomial has /„ = so there is 
no way to successfully "decode" a good label for v. 
To get around this, we modify the test 71 to set 
y = (aihi + gf + b6, a 2 h 2 + gi + b6, . . . , a k h k + + 
b5, gi, . . . , gk); intuitively this modified test checks 
whether the polynomial f e is of the form Xu — (xf > ) d . 
The bulk of our work is in analyzing the soundness of 
this test; we show that any polynomial f e that passes 
the modified test with probability significantly better 
than 1/2 must have almost no coefficient weight on 
cross-terms, and that in fact the restricted polynomi- 
als f u , fv can each be decoded to a small set in such a 
way that there is a matching pair as desired. We give 
a complete description and analysis of our Dictator 
Test and prove Theorem 11.11 in Section [5] 

For Theorem 11.21 a first observation is that the 



test 71 in fact already has soundness 3/4 + e for 
degree-2 PTFs. To see this, we begin by writing 
the degree-2 polynomial f e (x) as 9 + f\{x) + fi{x) 
where fi(x) is the linear (degree 1) part and fi{%) 
is the quadratic (degree 2) part (note that fx is 
an odd function and f% is an even function). We 
next observe that since any vector r is generated 
with the same probability as —r, the test may be 
viewed as randomly selecting one of the following 4 
inequalities to verify: f e (r + 5w) > 0, f e (r — Suj) < 0, 
f e (-r + Suj) > 0, / e (-r - Suj) < 0. If all four 
inequalities hold, then combining f e (r + Suj) > 
with f e {—T — Suj) < we get that /i(r + Suj) > 
and combining f e (r — Suj) < with f e {—r + Suj) > 
we get fi(r — Suj) < 0. Consequently, if a degree-2 
polynomial f e passes the test with probability 3/4+e, 
then by an averaging argument, for at least an e 
fraction of the r-outcomes all four of the inequalities 
must hold. This implies that for an e fraction of the 
r's we must have fi(r + Suj) > and fi(r — Suj) < 0, 
and so the degree- 1 PFT fx must pass the Dictator 
Test 71 with probability at least 1/2 + e. This 
essentially reduces to the problem of testing degree- 1 
PTFs, whose analysis is sketched above. 

To get the soundness down to 1 /2 more work has 
to be done. Roughly speaking, we modify the test 
by checking that sign(/(fcir + k 2 Suj)) = sign(fc 2 ) for 
ki , k 2 generated from a carefully constructed distri- 
bution in which k\ , k 2 can assume many different pos- 
sible orders of magnitude. Using these many different 
possibilities for the magnitudes of kx,k 2 , a careful 
analysis (based on carefully combining inequalities 
in a way that is similar to the previous paragraph, 
though significantly more complicated) shows that if 
a polynomial passes the test with probability 1/2 + e 
fraction then it can be "decoded" to a small set of co- 
ordinates. In addition to this modification, to avoid 
using the Unique Games Conjecture we employ the 
"folding trick" that is proposed in [9j [19] to ensure 
consistency across different vertices. One benefit of 
using this trick is that with it, we only need to de- 
sign a test on one vertex instead of an edge @ The 
complete proof of Theorem 11.21 appears in Section [5] 

4 Hardness of proper learning noisy degree-d 
PTFs: Proof of Theorem [Hi] 

4.1 Dictator Test Let / : R 2n 4 M be a 2n- 

variable degree-d polynomial over the reals. The key 
gadget in our UG-hardness reduction is a dictator 

^The reason that we can not use "folding" for our first result 
on low-degree PTFs, roughly speaking, is that such a folding 
does not seem able to handle cross-terms of degree greater than 
2. 



test of whether / is of the form sign(xi — xf l+i ) for 
some i £ [n]. More concretely, our dictator test 
queries the value of / on a single point y 6 M. 2n 
and decides to accept or reject based on the value 
sign(/(y)). 



Td- Matching Dictator Test for degree-d 
PTFs 

Input: A degree-d real polynomial / : M. 2n — > M. 
Set P := 1/logn and S := 2^ . 

1. Generate n i.i.d. bits ai £ {0, 1} with Pr[a; = 
1] = ft, i € [n]. Generate 2n i.i.d. N(0, 1) 
Gaussians {^i, <7i}"=i- Generate a uniform 
random bit b £ {—1, 1}. 

2. Set y — {yi)l=x where y t — a 2 h t + gf + bS and 
y n +i = 9i,i £ [n], 

3. Accept iff sign(/(y)) = b. 

We can now state and prove the properties of our 
test. The completeness is straightforward. 

Lemma 4.1 (Completeness). The polynomial 
f(x) = Xi — x^ +i passes the test with probability at 
least 1 — (3. 

Proof. Note that f(y) = Oihi + bS. Hence if a.; 
1 1 we have sign(/(y)) = b and this happens with 
probability 1 — ft. □ 

To state the soundness lemma we need some 
more notation. For a degree-d polynomial f{x) = 
J2sc[n].\SKd c s ■ Xs(x) we denote wt(/) = Es^flM- 
For 9 > 0, we define I e (f) := {i £ [n] \ 3S 3 
is.t. \c s \ > 9 ■ wt(/)/("+' i )}. Note that for 9 £ 
[0,1] we have that I${f) ^ 0, since there are 
( n ~^ d ) nonempty monomials of degree at most d over 

Xx i • • • t x n . 

Let / : K 2 ™ — > K be a 2n-variable polynomial 
f( x ) = Y,sc[2n],\SKd c s ■ Xs{x) fed as input to our 
test. We will consider the restrictions obtained from 
/ by setting the first (resp. second) half of the 
variables to 0. In particular, for x = (xx, ■ ■ ■ , x 2n ) we 
shall denote fi(xx, ■ ■ ■ , x n ) = fixx, ■ ■ ■ , x n , n ) and 
f2(x n +x, ■ • ■ , x 2n ) = f(0 n , x n+1 , . . . , x 2n ). 

We are now ready to state our soundness lemma. 
The proof of this lemma poses significant complica- 
tions and constitutes the bulk of the analysis in this 
section. 

Lemma 4.2 (Soundness). Suppose that f(x) = 
Escpn] \s\<d c S'Xs(x) passes the test with probability 



at least 1/2 + j3. Then for fa, fa as defined above, we 
have |7 . 5 (/i)| < l/P 2 ,\h(fa)\ < 1//3 2 - /n a<Mton, 
every i G [n] such that n + i G Ii{fa) olso satisfies 

i e /0.5(A)- 

Proof. We can assume that wt(/) > 0, since other- 
wise / is a constant function, hence passes the test 
with probability exactly i . Since our test is invariant 
under scaling, we can further assume that wt(/) = 1. 

Let x G IR 2n . By definition, f\{x) = Esc[n] c s ' 
Xs{x) and fa[x) = Esc[n+i,2n] c s ' Xs(x). We can 
write 

f(x) = f 1 (x) + fa(x) + f 12 (x) 

where / 12 (x) = Esc[2n],sn[n]#0,sn[«+i,2«]^0 c S ' 
Xs(z). 

Let us start by giving a very brief overview 
of the argument. The proof proceeds by carefully 
analyzing the structure of the coefficients cs for the 
subfunctions fi, fa, fi2- In particular, we show that 
the total weight of the cross terms (i.e. wt(/i2)) is 
negligible, and that the weight of / is roughly equally 
spread among fa and fa. Moreover, the coefficients of 
fi , fa are either themselves negligible or "matching" 
(see inequalities (i)-(iv) below). Once these facts have 
been established, it is not hard to complete the proof. 

The main step towards achieving this goal is to 
relate the coefficients cs with the coefficients of an 
appropriately chosen restriction of /, obtained by 
carefully choosing an appropriate value of a G {0,1}™. 
We start with the following crucial claim: 

Claim 4.3. Suppose f passes the test with probability 
at least 1/2 + fj. Then there exists a' G {0, 1}" such 
that , 

\\f a ,h ^2-".log d n. 



Proof of Claim \4-3\ Let us start be giving an equiv- 
alent description of the test. Denote v = (1„,0„) G 
K 2 ™, r = (r i ) 2 ™ 1 with n = a^ + gf and r n+i = g t , 
i S [n]. Note that y = r + (bS)uj. Then the Dictator 
Test Td is as follows: 

• Generate r, and with probability 1/2, test 
whether f(r+Suj) ^ 0; otherwise test f(r—8ui) < 
0. 

Hence, since / passes with probability 1/2 + (3, with 
probability at least 2f3 over the choice of r, the 
following inequalities are simultaneously satisfied: 

f(r + Suj) ^ 0;f(r-6u>) < 0. 



\f(r + 5u)-f(r)\ 

Sn[n] 



■S) 



Uiesn)\ 
^^TCsn[n]^' T ' 'ILesvrl 

•2 |s - (Ml, 



< Ei^sKdM • (Ed 

^ El^lSKtil^ l " * 1" " llieS:ri>ll 

The last inequality follows from the fact that 
there are at most 2' s ' terms in the second summation 
each bounded from above by S ■ Yiies r >i\ r i\- 

We now claim that with probability at least 1 — 
n _1 over the choice of r it holds M :— maxj S [ 2 „] \ri\ ^ 
log d n. To see this note that if max^^i-d^l, \hi\} ^ c 
then M < 2c d . Now recall that for g ~ N(0, 1) and 
c > 2 we have Pr[|g| > c] ^ e~ c / 2 . The claim follows 
by fixing c = 0(log 1 ^ 2 n) and taking a union bound 
over the corresponding In events. 

Therefore, with probability 1 — n over the 
choice of r, we have 

\f(r + 6u>) - f(r)\ «C 6 ■ 2 d ■ (\ognf ■ wt(/) ^ 2"". 
Analogously we obtain that \f(r) — f(r — 5lj)\ ^ 2~ n . 
We conclude that with probability 2/3 — n" 1 > j3 over 
r 



(4.1) 



|/(r)|<2-». 



Recall that r is a random vector that depends on 
a,g,h. For every realization of a G {0,1}™, we 
denote the corresponding restriction of / as f a {g, h); 
note that f a {g, h) is a degree d 2 real polynomial over 
Gaussian random variables. Let us denote H/0II2 : — 
^ g , h [f a {g,hfY/\ 

At this point we appeal to an analytic fact 
from [5]: low degree polynomials over independent 
Gaussian inputs have good anti-concentration. In 
particular, an application of Theorem I A. 2 1 for f a (g, h) 
yields that for all a G {0, 1}™ it holds 

Vv g . h [\f a (g,h)\ < 2-"] < d 2 ■ (2-'V||/ Q || 2 ) 1 / d2 . 

Combined with (|4.ip this gives 

(3 < Pr a , 9A [|/ a (fl,/i)|a/2 n ] 



< E 



d 2 ■{2- n /\\f a \\ 2 ) 1 ' d2 



Now let us fix a' := argmin ae { ,i}'» ll/alb; the above 
relation implies (2~ n /\\f a >h) 1/d ' > P or ||/ ,|| 2 < 
2~ n (l//3) d as desired. This completes the proof of 
Claim E3 □ 



Since a' is fixed, we can express f a i as a degree-d 2 
polynomial over the gi's and h^s. Let us write 



We now upper bound |/(r + 5uj) — /(r)|: 



fa' - J2t,t> w t,t> ■ I\ ieT gi ■ U te T' h 



where T, T C [n] are multi-sets satisfying |T| + \T'\ ^ 
d 2 and wt,t' — wt,t'{o>')- Since f a > has small 
variance, intuitively each of its coefficients should also 
be small. The following simple fact establishes such 
a relationship: 

Fact 4.4. Let f : M 1 — > R be a degree-d polynomial 
f(x) = T,\s\^d c S ■ Xs{x) and Q ~ N(Q, l) 1 . For all 
T C [I] we have \\f(G)\\ 2 ^ d~ d ■ M/('+ d ). 

Proof of Fact \4-4\ The fact follows by expressing 
/ in an appropriate orthonormal basis. Let 
{Hs}sc[i] ,\s\^d be the set of Hermite polynomials of 
degree at most d over I variables, let and f(x) — 
J2\s\^d f{S)Hs(x) be the Hermite expansion of /. 
Then, = £/(5) 2 which clearly implies that 

||/(a)|| 2 >max s |/(5)|. 

Fix an 5 C [I] with |5| ^ d. By basic properties 
of the Hermite polynomials (see e.g. [13]) we have 
that H s {x) = J2ucs h s ■ Xu(x) with \h%\ < d d . 
Hence, for a fixed T C [Z], ct can be written as 
Ssdt h-sf(S). Since 5 C [I] and |5| < d, there are 
at most terms in the summation. Therefore, 

it must be the case that there exists some 5 such 
that |/(5)| ^ d- d ■ |c T |/(^ d ). This completes the 
proof. □ 

Notation: For the remaining of this proof we will be 
interested in the coefficients Wt,T' for T' = 0. For 
notational convenience we shall denote wt '■= wt,®- 
We now claim that for all T we have 

(4.2) \ WT \ < n- 10d . 

Using Fact 14.41 if this were not the case we would get 
a contradiction with Claim 14.31 

At this point we establish the relationship be- 
tween the wt's and the coefficients cs of / in our 
original basis {xs}- 

By definition, the restriction obtained from 
f a '(g-,h) by setting the hi variables to is identical 
to the function f(gf,...,g d ,g 1 ,...,g n ). Therefore 
we have 

(4.3) Etc[»]Wt • H ieT 9i = 

J2sQ[2n] C S ' Ui£Sn[n]9i ' U(n+i)es9i 

For any fixed T in the LHS of (14.3[) there is an 
equivalence class of sets 5 in the RHS such that the 
monomial lUsnM^ ' U.(n+i)es9i e q uals H ieT 9i- It 
is clear that wt equals J2s Cs > where the sum is over 
all 5 in the equivalence class. In fact, the structure of 
the equivalence classes is quite simple, as established 
by the following claim: 



Claim 4.5. For any Sq ^ 5i C [2n] of size at most 
d, if 

(4-4) ru S n[n]9i ' Yln+jeS ,je[n]9j 

= Hies 1 n[n]9i ' Un+jes 1 ,je[n]9j^ 

then there exists some I S [n] such that Sq — {£} and 
Si = {n + £ : d} or vice versa. 

Proof of Claim 14.51 Consider the following two 
complementary cases. 

• 5o fl [n] ^ Si n [n] . Without loss of generality, 
we can assume that there is some t € 5o n [n] 
with £ ^ Si. (Otherwise the role of So, Si can 
be reversed.) Then to make (|4.4p hold, it must 
be the case that Si contains d copies of n + I. 
Now, since |5i | ^ d, it can only be the case that 
Si = {n + £ : d}, which implies that Sq = {£}. 

• 5 n [n + l,2n] ^ Si n [n + l,2n]. We may 
assume that there is some £ £ [n] such that 
(n + l) G Sq. Then, for ([O]) to hold, it must 
be the case that £ G 5i. Hence, it must be the 
case that 5i = {n + £ : d} (since gi is raised to 
the dih power in the RHS of (|4.4I) ) ; this in turns 
enforces So = {£}■ □ 

Claim 14.51 implies the following relation between 
the coefficients cs and wt'- 

(A) If T — {i : d}, for some i G [n], then we have 
wt = est + cs 2 with Si = {i} and S2 = {n + i : 
d}. 

(B) If T is not of the above form, then there exists a 
multi-set 5 C [2n], |5| ^5 d, where 5 7^ {i} and 
5 7^ {n+i : d} for any i G [n], such that T equals 
{i : d I i G 5} U {i \ n + i G 5}. In this case, we 
have wt = cs- 

We are now ready to establish the desired bounds 
on the coefficients of the subfunctions /1, /a, /i2- 

(i) For all 5 C [n] with |5| > 2, gj) and (B) yield 
\cs\ < n- wd . 

(ii) For all 5 C [n + l, 2n] with 5 7^ {n + i : d} for 
some 2 G [n], (|4T2j) and (B) yield \c s \ < n" 10d . 

(iii) For all i G [n], by (|4.2[) and (A) we obtain 

|l C {i}l - |c{n+i:d}|| < |c{i} +C{ n+i:d }| < n^ 10d . 

(iv) For all 5 such that 5n [n] 7^ and 5n [n+l, 2n] 7^ 
0, (O and (B) yield |c<y| < n~ wd . 

Since the coefficients of fi , / 2 are either very small 
(cases (i), (ii) above) or matching (case (iii)), we get 



|wt(/i) -wt(/ 2 )| < n- wd ■ ( n + d ) n~ l . Moreover, 
since every coefficient of /12 is small (case (iv)), we 
deduce that wt(/ 12 ) < n - 10d -( 2n + d ) ^ rT 1 . Recalling 
that wt(/i) + wt(/2) + wt(/i2) = wt(/) = f , we get 
wt(/i) + w t(/2) ^ 1 — „■ Combining these bounds, 
we get that 

(4.5) 0.51 >wt(/i),wt(/ a )> 0.49. 

Now fix an i G [n] with (n + i) E h(f2)- The 
above inequality implies that there must exist some 
S 3 (n + i) such that \c s \ > 0.49/("+ d ). By (ii), we 
deduce that it can only be the case that S equals {n+ 
i : d} (as all other coefficients in f% are very small). 

Moreover, (hi) implies that |cj| > 0.48 ( n + d ) , hence 
i € lo.s(fi) (recalling that wt(/i) < 0.51). So we 
have |/i(/2)| ^ l-fo.sC/OI and it remains to bound 
from above the size of 7o.s(/i) by /3~ 2 . 

Suppose (for the sake of contradiction) that 
IW/OI > /?~ 2 . Since wt(/x) ^ 0.49, every j G 
lo.sifi) comes from the set S = {j} (as all the other 
coefficients of fx are too small). Consider all pos- 
sible realizations of a G {0, 1}™. With probability 
1 - (1 - ^IW/i)! J> 1 - n" 1 over the choice of 
a, there exists i € ^o.s(/i) with aij = 1. Fix such 
an i. By the dchnition of Jo.g(/i), we must have 

|c w | ^0.5 -0.49 ("+<*) ~* > 0.2- ( n+ d d y\ Hence, there 
will be a degree- 1 monomial in the expansion of f a 
as a polynomial over g and ft. whose coefficient has 
absolute value at least 0.2 • ( n + rf ) . 

The aforementioned and Fact l4.4l implv that with 
probability 1 — n^ 1 over a it holds 

By Theorem IA.2I and the fact that wt(/) = 
1 we get that Pr a ^ g j l [\f a (g,h)\ ^ 2~™] is at most 
rT 1 + 0(d 2 ■ n 2 ■ 2~ n / d2 ) = o(/3), which contradicts 
(|4.1j) . This completes the proof of Lemma 14.21 □ 

4.2 Hardness reduction from Unique Games 

With the completeness and soundness lemmas in 
place, we are ready to prove Theorem 11.11 The 
hardness reduction is from a Unique Games Instance 
£(U, V, E, II, k) to a distribution of positive and neg- 
ative examples. The examples lie in R(l c/ I+I y l) fc and 
are labeled with either (+1) or ( — 1). Denote dim = 
(\U\ + \V\)k. 

For w G U U V and x G M dim , we use x$ to 
denote the coordinate corresponding to the vertex 
i«'s i-th label. We use x w to indicate the collec- 
tion of coordinates corresponding to vertex w; i.e., 
(x£\ x (2 \ . . . , xL fc) ). For a function f(x) : R dim ->■ K, 



we use /„ to denote the restriction of / obtained by 
setting all the coordinates except x u to 0. Similarly, 
f u _ v denotes the restriction of / obtained by setting 
all the coordinates except x u ,x v to 0. 

In the reduction that follows, starting from an 
instance £ of Unique Games, we construct a dis- 
tribution T> over labeled examples. Let us denote 
by Opt(D) the agreement of the best degree-d PTF 
on T>; our constructed distribution has the following 
properties: 

• If Opt(£) =1-77, then Opt (2?) = 1 - »7 - ^5 
and 

. If Opt(£) < l/k ^, then Opt(X>) < \ + 

This immediately yields the desired hardness 
result. We now describe and analyze our reduction. 





Reduction from Unique Games 


Input: Unique Games Instance £(U,V,E,H,k). 


Set P = ^ and S = 2^ . 


1. 


Randomly choose an edge (u, v) G E. 


2. 


Set y w = for any w G U U V such that 




W 7^ U,W 7^ V. 


3. 


Generate k i.i.d. bits a* G {0, 1} with Pr[a; = 




1] = j3, 2k independent standard Gaussians 




{hi>9i}i-i an d a uniform random sign b G 




{-!,!}■ 


4. 


For all i G [fc], set y„' '■— 9i and := 




«A + {g^(i)) d + 5b. 


5. 


Output the labeled example (y,b). 



Lemma 4.6 (Completeness). If Opt(£) = 1 — T], 
then there is a degree-d PTF that is consistent with 
1 — rj — /3 fraction of the examples. 

Proof. Suppose that there is a labeling L that satis- 
fies 1 — 77 fraction of the edges. Then it is easy to 
verify that the degree-d PTF 

sign(E„ e rf (u)) -£^(4 iW ^) 
agrees with 1 — 77 — /3 fraction of the examples. □ 

Lemma 4.7 (Soundness). J/Opt(£) < l/k e( - r '\ then 
no degree-d PTF agrees with more than 1/2 + 2(3 
fraction of the examples. 



Proof. Suppose (for the sake of contradiction) that 
some degree-d polynomial / satisfies 1/2 + 2/3 fraction 
of examples. Then by an averaging argument, for /3 
fraction of the edges (u, v) picked in the first step, 
we have that f(x) agrees with the labeled example 
(y,b) with probability 1/2 + /3. Let us call these 
edges "good". 

Fix a "good" edge e = (u, v) and let us assume for 
notational convenience that 7r e is the identity map- 
ping. Essentially, we are conducting the test Td for 
the restriction f uv with parameter n :— k. Since f UjV 
passes the test with probability 1/2 + /3, Lemma l4~2l 
implies that we must have that Io 5(fu), h(fv) ^ 
and |/i(/„)|,|/q. 5 (/ u )| < 1//3 2 . 

We are now ready to give our randomized label- 
ing strategy (based on /). For every u € U, randomly 
pick its label from Zo.sC/u) an d for every v E V ran- 
domly pick its label from I\(f v ). It is clear that each 
good edge is satisfied with probability /3 2 . Since at 
least /3 fraction of the edges is good, such a label- 
ing satisfies at least /3 3 = l/(logfc) 3 fraction of the 
edges in expectation. Hence, there exists a labeling 
that satisfies such a fraction of the edges, which con- 
tradicts the assumption that Opt(£) ^ l/k v , for k 
sufficiently large. □ 

4.3 A technical point: Discretizing the Gaus- 
sian Distribution Lemmas 14.61 and [4 . 71 do not quite 
suffice to prove Theorem 11.11 because the reduc- 
tion described above is not computable in polyno- 
mial time. This is because the distribution T> has 
infinite support; recall that for each edge e, sampling 
from the corresponding distribution T> e requires gen- 
erating 2k independent Gaussian random variables 
h = {hi,...,h k ),g = {gi,...g k ). 

To discrctize the reduction we replace h by hi and 
g by g' , where each of the 2k random variables h\, g\ 
is independently generated as a sum of N uniform 
{-1,1} bits divided by VN. In Theorem gU of 
Section 14.3.11 we argue that for sufficiently large N 
(in particular any N ^ (2k) 24 ^ d > suffices), there 
is a way to couple the distribution of (g, h) with 
that of (g',h') such that every degree-G? 2 polynomial 
takes the same sign on (g,h) as on (g',h') except 
with probability at most 1/k. Since every outcome 
of a E {0, l} fe results in the polynomial f a (g,h) 
being a degree-d 2 polynomial, if we replace (g, h) with 
(</, h') in the reduction then the discretized reduction 
will almost preserve the soundness and completeness 
guarantees of Section 14.21 with only a loss of i: 
writing V for the discretized distribution, we have 

• IfOpt(£) > l-Tj, then Opt(P') ^l- v -^- 
X/k; and 



• IfO P t(£) 1/k*, then Opt(V) < I + ^ + l/fc. 

Finally, we observe that the distribution of 
{g',ti) has support of size (N + l) 2fc ^ (2N) 2k 
(4k) 4sd k ; since the label size k is regarded as con- 
stant in a Unique Games instance, this is a (large) 
constant for constant d. Thus it is possible to simply 
enumerate the entire support of T> in polynomial time 
(since there are \E\ distributions T> e , the overall size 
of the support of T> is polynomial in the size of the 
Unique Games instance) and consequently there is no 
need for randomness - the entire overall reduction is 
deterministic. Theorem 11.11 now follows by choosing 

appropriate settings of r\ and k (e.g., r\ = e/2 and 

i / 2 

k — e ' £ suffices). 

Finally, we note that the above remarks imply 
that Theorem 11.11 holds not only for constant d, but 
for d as large as 0((log n) 1 / 4 ) - since k is constant, for 
such d the support size (4/c) 48d k is still polynomial 
in n. 

4.3.1 Discretizing the Gaussian distribution 

The following theorem shows that there exists a 
distribution Hn/^/N that is point-wise close to a 
Gaussian distribution Q with high probability: 

Theorem 4.8. There is a probability distribution 
(G,Hn) on R 2 such that the marginal distribution 
Q of the first coordinate follows the standard N(0, 1) 
Gaussian distribution, and the marginal distribution 
T-Ln of the second coordinate is distributed as a sum of 
N random bits, i.e., = yi-_-, bi where each bi is 
an independent random bit from {—1, 1}. In addition, 
%M and Q are pointwise close in the following sense: 
Pr[\G - 7f K OiN- 1 ' 4 )] > 1 - OiN- 1 ^). 

Proof. Let $ be the CDF (cumulative distribution 
function) of Hat, and let * be the CDF of Q (the 
standard Gaussian Distribution). 

We couple the random variables G^n in the 
following way: to obtain a draw (go, ho) from the joint 
distribution, first we sample ho from the marginal 
distribution on Hw. We know that 

Pr[n N = h ] = $(fto) - $(ho - 2), 

since if ho is a feasible outcome of summing N bits 
then ho — 2 is the largest feasible outcome that is 
less than ho (if any feasible outcome less than ho 
exists). Then we generate go by drawing random 
samples from the standard Gaussian distribution 
until we obtain a sample that lies in the interval 
($- 1 ($(j ( i - 2)), * _1 ($(/i )]; when we obtain such 
a sample, we set go to this value. 

It is not difficult to see that the random variable 
Q defined in this way follows the standard Gaussian 



distribution; essentially we are using the value of 
ho as a indicator of whether Q is in the interval 
(*- x ($(/i - 2)), ^(^Oo)]. We also need to check 
that Pr["H = ho] is equal to Pr[£ e (* _1 ($(/io - 
2)), I'-^Oo))]. This is true because 

Pr[H = h ] 
= Pv[he{ho-2,h ]] 
= $(V> - Hh - 2) 
= Pr[£ e t*" 1 (*(^o ~ 2)), *" 1 ($(/ l0 ))]]. 

With the above coupling of Q and H, it remains 
to prove that every value in the interval ( 1 i>~ 1 (Q(ho — 
2)), * -1 ($(/i ))] is close to h /VN, with high proba- 
bility over a random choice of ho as described above. 
It suffices to verify that the following two inequalities 
each hold with probability at least 1 — 0{N~ 1 / 4 ): 



y-\$(h a ))--% 



y-\$(h -2))--% 



0{N- l ' A ) 



and, 



We consider the first inequality; the first one is 
entirely similar. We show that * _1 ($(/io)) - ^= < 
OiN- 1 / 4 ); the other direction ^(^(ho)) ~ ^ > 

-OiN- 1 / 4 ) is similar. 

By the Berry-Esseen Theorem (Theorem IA.1I in 
Section E|, we have that |$(/i ) - < -jw 



Therefore, we have that 



(4.6) 



where the "error term" E^ is the value for which 
*{ho/VN + E h ) - * (h /VN) = 1/VN. 

If \ho\ ^ \J N *2 N , then in an interval of width 

N 1 / 4 around ho the PDF of the standard Gaussian 
is everywhere at least f2(7V -1 / 4 ); consequently, if 

\ho\ < \J N ^2 N tnen tne error t erm E ho is at most 
0{N^ 1 / 4 ) as required. A standard Chernoff Bound 
' N lfK\ is atmostO(Ar- 1 /4) ; 



implies that Pr[|/i | < 
and the argument is complete. 



□ 



Now we use the joint distribution constructed in 
Theorcm l4.8l to discretize the standard n-dimensional 
Gaussian space for low-degree PTFs. 

Theorem 4.9. Fix any constant D ^ 1, and let 



f(xi,...,x n ) = 



\SKD 



es ■ 



be a degree-D 



by taking each pair (y%,Zi) to be an i.i.d. draw from 
the distribution (Q,Hn) of Theorem \4-S\ where we 



take N = 



The 



have 



Pr[sign(/(y)) + sign(/(z))] < 0(l/n). 

Proof. First, we may assume without loss of gener- 
ality that the polynomial / is normalized so that 
J2s^ti 1/(^)1 e q ua l s 1- Since there are at most ( n ^ D ) 
coefficients in /, one of these coefficients f(S) must 

satisfy |/(S)| ^ / n +n\ ; now Lemma T4.4I implies that 
I D ) 

\\fh > ^+Jf D o ■ 

We have 



Pr[sign(/(y)) ^ sign(/(z)] 



Pr[|/(y)| 

\m-m\i 

To bound the latter probability by 0(l/n), we show 
that | > n- 3 ° 2 with probability 1 - 0(l/n), 
and that \f(z) — f(y)\ < n~ 3D ~ with probability 
l-0(l/n). 

The first desired bound, Pr[|/(y)| n^ 2 ] 
0(l/n), is an immediate consequence of Theo- 
rem E2J 

For the second, we note that by a union bound 
and Theorem 14.81 with probability at least 1 — 
Oin/N 1 ' 4 ) > 1 - O(i) every i € [n] satisfies 
\Vi - Zi\ < 0{N- 4 / 4 ). Standard Chernoff bounds 
and Gaussian tail bounds give that the probability 
any \yi\ or \z%\ exceeds n 4 / d is much less than 1/n. 
Now similar to the calculation used to bound f(r + 
5u) — f(r)\ in the proof of Claim |4~51 when y and 
z are (3(iV _1 / 4 )-close in each coordinate and each 
coordinate is at most n x / d , we have that 



\f(y) - f(z)\ < 0{N-V*) ■ 0{n) < n 
This concludes the proof. 



3ZT 



□ 



polynomial overW 1 . Let (y, z) € R n xR" be generated 



5 Hardness of learning noisy halfspaces with 
degree 2 PTF hypotheses: Proof of 
Theorem ITT21 

Similar to Section [H the proof has two parts; first 
fSection l5.1[) we construct a dictator test for degree 2 
PTFs, and then (Section l5.2p we compose the dictator 
test with the Label Cover instance to prove NP- 
hardness. 

5.1 The Dictator Test The key gadget in the 
hardness reduction is a Dictator Test that is designed 
to check whether a degree-2 PTF is of the form 
sign(iEi) for some i 6 [n]. Suppose / is a degree 2 
polynomial 

f(x)=6 + f 1 (x) + f 2 (x), where 



fl(x) = ^ °i X i and h( X ) = J! C ij X i X j- 

Below we give a one-query Dictator Test Ti for 
sign(/(a;)). 



We show that Ti has the following completeness 
and soundness properties. 

Lemma 5.1. (Completeness) For i G [n], the poly- 
nomial f(x) = xi passes Ti with probability at least 
1-/3. 

Proof. If f(x) = Xi for some i G [n], then as long as 
a, is set to zero in step 1 we have that f(x) = bSt 2 
and / passes the test. By definition of the test a, is 
with probability 1 — (3. □ 

Lemma 5.2. (Soundness) Let A denote X)™=i Ci an< ^ 
let 1(f) be the set {i \ a > A/n 2 }. If a degree-2 
polynomial f passes the test with probability at least 
1/2 + 13, then \I{f)\ < l/(3 2 and A > 0. 

Proof. The proof is by contradiction. Let / be a 
degree- 2 polynomial with |/(/)| > l/B 2 or A < 0, 
and suppose that / passes the test with probability 
at least h+ (3. 

First we show the following lemma. 

Lemma 5.3. Pr r [/ X (r) G (-5A,SA)] < 

Proof. The inequality obviously holds for A ^ since 
the interval has measure 0. Thus we may assume that 
A > and |7(/)| ^ 1//3 2 . We know that in step 1 
when generating the bit-vector a, with probability at 



least 1 - (1 - B) lI(f)l > 1 - i at least one of the 
coordinates in /(/) has its bit a; nonzero. Fix any 
such outcome for the bit-vector a; now considering 
the random choice of the Gaussians gi,...,g n , we 
have that the resulting /i(r) is a Gaussian variable 
with variance at least A 2 /n 4 (as one of the weights 
is at least A/n 2 ). Using the standard fact that an 
N(a,/j) Gaussian random variable puts probability 
mass at most t/a on any interval of length t, we have 
that for such an outcome of the a-vector, 

Now a union bound gives that for at most — of the 
r generated, f(r) is inside the interval (—6 A, 8 A). □ 

Now we observe that for any outcome r, the 
vectors r and —r are generated with equal probability. 
Thus an equivalent test to T2 would be to generate r, t 
as described by the test and then check a randomly 
selected one of the following four inequalities: 



(5.7) f{t 3 r + t 2 5uj) > 

(5.8) f{t 3 r - t 2 6uj) < 

(5.9) f{-t 3 r + t 2 6uj) ^ 

(5.10) f(-t 3 r - t 2 5u) < 0. 



Since / is assumed to pass the test with proba- 
bility i + /3 an averaging argument gives that for a 
(3/2 fraction of the possible outcomes of r, at least a 
(■| +(3/2) fraction of all the constraints involving that 
r outcome are satisfied. (Note that for any fixed out- 
come of r there are 4(logn) 2 constraints, correspond- 
ing to inequalities (|5.7[) - (|5.10j) for each of the (log n) 2 
possible values of t.) For this (3/2 fraction of r, let us 
remove those outcomes r such that pi (r) G (— 5 A, 5 A) 
(recall that this is at most a 2/n fraction of all r- 
outcomes). Recalling that (3 — j-^-r, we know there 
are at least /3/4 fraction of r-outcomes remaining; we 
call these "good" r's. 

Let us fix a good r. By an averaging argument 
again, for any "good" r, for at least a (3 /A fraction 
of the possible outcomes of t, at least 3 out of the 4 
of the inequalities that contain t and r are satisfied. 
There are 4 different ways of choosing 3 out of the 4 
constraints. Without loss of generality, let us assume 
that for a (3/1Q fraction of the t-outcomes, the first, 
second, and fourth constraints (|5.7|) . (|5.8|) and (|5.10p 
are satisfied. That is: 

(5.11) f{t 3 r + t 2 6uj) > 

(5.12) f{t 3 r - t 2 Suj) < 

(5.13) f(-t 3 r - t 2 6uj) < 0. 



72= Dictator Test for Degree-2 
Polynomials 

Input: A degree-2 real polynomial / : W 1 — > M 
Fix 8:=t±- and 6 := 2"". 

r logn 

1. Generate independent bits 01, 0,2, ■ ■ ■ , a n G 
{0, 1} each with expected value (3. 
Generate n independent N(0, 1) Gaus- 
sian variables <?!,... ,<?„. Set r — 
(aiffl, ^252, • ■ • , a n g n ). 

2. Generate t by randomly picking a number 
i G {1, 2, . . . , (logn) 2 } and set t = n\ 
Generate a random bit b £ { — 1,1}. 

3. Set w G K" to be the all- Is vector 
(1, . . . , 1) and set y = t 3 r + bt 2 5uj. 

4. Accept iff sign(/(y)) = b. 



Using the fact that fi(r) > SA, the inequal- 



t\D 



+ t x E + e/t\) > 



Let us call these t "good" for the corresponding r, and 

let us denote the set that contains all the "good" t for ity (|5.17l) gives — (t\C 

a given "good" r by T r . Since the possible choice of (1 — j^)SA, which may be rewritten as 8 A ^ 
t = n l ranges over all i £ [log 2 n] , we therefore obtain 
\T r \ > (logn) 2 -/3/16 = e(logn). 

Since fix) is a degree 2 polynomial, we can 
express f(r + Sui) as: 



-(hC-hD+uE+e/t;) Combining this with u^ m _ we 
know that for any t\,t2 € T r , we have 



f(r + Slu) 



S 2 



+ /i(r) + / 2 (r) 

^2 c " + * 



cy(r* +r 3 -). 



Let us denote i? 



E 



and 



EK, $K« c 'j( r ' +r j)- We can rewrite <|5.1ip . (|5.12l) . 
(lETgf as: 

(5.14) 

* 3 /iM + t 2 M + i 6 / 2 (r) + t^/^r) + t 4 5 2 B + 9^0 
(5.15) 

t 3 /i(r) - t 2 M + f 6 / 2 (r) - f 5 ^(r) + t 4 <5 2 B + » < 
(5.16) 

f 3 /i(r) + t 2 M - t 6 f 2 (r) - t 5 8f 2 {r) - t 4 8 2 B - > 

Notice that ([5TT4]) and (f5TTC|) yield 

/i(r) ^ -SA/t + \t 3 f 2 (r) + 8t 2 f' 2 (r) + S 2 tB + 9/t 3 \. 

Since we already know that /i(r) <^ (— 8 A, 8 A) 
and £ is at least 1, we get that 

fi(r) > 8 A. 

Also for (|5.15[) , we can rewrite it as 

/i(r) < SA/t - (t 3 f 2 (r) - 8t 2 f 2 (r) + 8 2 tB + 9/t 3 ). 

Let us further simplify the notation by writing C 
for f 2 (r), D for 5f 2 {r) and E for 8 2 B. Then we may 
rewrite the above constraints as follows: 



fi(r) > -SA/t + \t 3 C + t 2 D + tE + 9/t 3 



and 



(5.17) SA < /i(r) < 8A/t-(t 3 C-t 2 D + tE + 9/t 3 ). 

Notice that above (upper and lower) bound hold 
for any t in T r . Therefore, we know that for any 

SA/ti - [t\C - t\D + tiE + 9/t\) 

> -SA/ti + \t\C + t\D + t 2 E + 9/4\ 



which is equivalent to 

(5.18) -(t\G -t\D + tiE + 6/t\) 



Ss \4C + tjD + t 2 E + 9/4\. 



(tlC-t 2 D + hE + 9/t 3 ) 1 + 



^ \4C + t 2 D 



(- + -) 



1 - 
t 2 E 



'1 



9/4 



By definition, t$ n for any i, so we have ' ± 

H 

3/n. Therefore, for any ti,t 2 in Tr, the following 

inequality holds: 

(5.19) 

-{t\C + t\D -hE + 9/t\) . 1 



|^C + t| J D + t 2 ^ + 6»/^| 



1 



1 1 



^ 1-3/n. 



Note that the denominator of the LHS of (|5.19|) can 
be zero for at most 6 values of t 2 ; we eliminate 
any such values from T r , and we still have \T r \ ^ 
0(logn). (Actually, we will only need \T r \ 5 
for the remainder of the argument to establish the 
required contradiction.) Let us pick to < ii < 
t 2 < t-& < ti from T r , and let us write G to denote 
-(tfC - t\D + hE + 9/t\). We know that 



G ^ t 3 \C\ + 4\D\ + h\E\ 
Also for tQ, t2, t3, ti, we write: 



(5.20) 
(5.21) 
(5.22) 
(5.23) 



= 4c - t 2 D 



F 
F 2 

j-2 



= t%C - t\D 



t E 
t 2 E 



\9\/t 3 . 
9/t 3 

9/4 



F* := t 3 C 



t\D + t 3 E + 9/4 



F A := t%C - t\D + t 4 E + 9/t 



Let F denote max i=0 , 2,3,4 \Fi\. By (|5.19|) we know 
that 



(5.24) 



G 
F 



> 1 - 3/n. 



Viewing C,D,E,9 as unknowns, we may solve 
the above linear system consisting of equations 
(l5T2^ . (i5^ . ([5T2^|) . (j5~2^1) using Cramer's rule. We 
find that 



C 



F 


-t 2 


to 


V4 




^0 


'0 


to 


1/4 


F 2 


-t 2 
l 2 




m 




F 2 


+2 
'2 


h 


1/4 


F 3 


-t 2 
l 3 




m 




F3 


+2 




1/4 


Fi 


t 2 


U 


m 




F A 


+2 

l 4 


u 


1/4 


t 3 


-t 2 


to 


1/4 




t 3 


t 2 
'0 


to 


1/4 


t 3 
l 2 


-t 2 
l 2 


t 2 


1/4 




t 3 

l 2 


/2 
'2 


t 2 


V4 


t 3 
l 3 


-t 2 
l 3 


t 3 






t 3 
l 3 


+2 
E 3 


ts 


V4 


t 3 


-t 2 


ti 






t 3 


f2 
r 4 


U 


V4 



Since < to < t 2 < t 3 < i 4 and these values are 
at least a factor of n apart from each other, we have 
that 

*o *o l/*o 
t 2 t\ t 2 l/t 2 

^3 h *3 l/*3 
^4 ^4 ^4 1/^4 

is n(tt*§i 2 *o 3 )- 

Since F ~ max,-o.2,3,4 \Fi\, we know that the 
absolute value of 

Fq t l/ty 

F2 t 2 t 2 l/if 

F3 tl t 3 1/tl 
F± t\ U 

is at most 0(Ft\t 3 t^ ? '). Thus we have |C| = 

Similar analysis shows that 
\D\ = 0(F/t 3 t 2 ); \E\ = 0(F/t 2 ); and |0| = O(Ft 3 ). 

Therefore, we have 

|a|*? + t?|U|+ii|B| + |ff|/t? 
< F • 0(tf/Ut 3 t 2 + +t?/ta* 3 + *i/ta + to/*?). 

Recalling that t i+ i/ti n as they are different powers 
of n, we have that 

f < 0(l/n). 

This contradicts (|5.24p and concludes the proof of the 
soundness Lemma, Lemma 15.21 □ 

5.2 Hardness reduction from Label Cover 

Recall that our reduction is from a Label Cover in- 
stance C specified by (U, V, E, k, m, II) . For nota- 
tional convenience let us write F(q) to denote the 
space of possible labels for vertex q e U U V, for 
u G U, F(u) denotes [k] and for v ^V, F(v) denotes 
[m]. 

We reduce to a learning problem with labeled 
examples in R\u\k+\v\m x |_ Xj ^ Let dim denote 

|J7|jfe + \V\m. For y G M dim and g G f7 U V, we write 

2/g to denote the vector consisting of all coordinates 

that correspond to vertex q, i.e. y u denotes (yu )ie[k] 

for u G U and y v denotes (yv^)ie[m] f° r v ^ V- 

We give the reduction from Label Cover to the 
learning problem below. The high level idea is that 
the Dictator Test T 2 is performed on the restricted 
function p v (y) for a random v £ V. 



Reduction from Label-Cover C 
Input: Label Cover Instance (U, V, E, k, m, II). 

1. Randomly pick a vertex v G V. 

2. For each w ^ v, w G U U V, set y w = 0. 

3. Let 01, . . . , a m be independent {0, 1} bits each 
with E[cij] = /3. Let gi, . . . ,g m be independent 
iV(0, 1) Gaussian random variables. Let z 
be chosen uniformly from [(logm) 2 ] and set 
t = m % . Let b be a random uniform bit from 
{-1,1}. 

4. Set r = (aigi,a 2 g 2 , a m g m ). 

5. Let w G i™ be w = (1,...,1), and set 
y v := i 3 r + bt 2 Suj. 

6. Output the labeled example (Fo\d(y v ), b) (we 
describe the folding procedure Fold(-) later). 



The learning problem is to find a degree 2 polynomial 
p : M. dlm -> {-1,1} such that sign(p(y)) = b 
for the largest possible fraction of labeled examples 
generated as described above. Let us denote 

p(y) = 0+ ]T cfyf 

q£UUV,i£F(q) 

+ V r {i ' j) llWwM 

<3i,<j 2 e!7uy,«EF(<7i j"£-F(g 2 ) 

Notice that in the reduction, when vertex v is 
picked we set all the coordinates to zero except y v . 
Essentially we are performing the test T 2 on the 
function 

Pv = 9 + £ + £ C («(0,«(j))f« 

i(E[m] i,j'G[m] 

which is the restriction of p(y) obtained by setting 
all the coordinates to zero except those coordinates 
corresponding to vertex v. The overall fraction of 
agreement of p(y) on all examples is the average 
probability, over all v <E V, that p v passes T 2 . 

Folding Trick: We use the "folding " technique 
that was first introduced in [SI QI5]. The trick es- 
sentially amounts to the following: instead of out- 
putting the labeled example (y, b) in the last step of 
the reduction, we output (Fold(y),6) where Fold(y) 
is the projection of y into a subspace H^- (defined 
below). Folding enables us to enforce that p takes 
the same value on different points in R dlm as long as 
they project to the same point in H 1 - . 



We define the subspaces H, H for our folding as 
follows: 

Definition 5.4. For every e — {u,v} £ E,i £ [k], 
we define b(e, i) £ M dlm to be the vector that has 
at every coordinate except that b(e, i)^' is 1 and for 
every j £ (ir e )~ 1 (i), b(e,j)v is —1. Let B be the 
collection of all such b(e,i), i.e. B = {b(e, i) | e = 
{u,v} £ E,i £ [k]}. We define H to be span(B) and 
H^- to be the orthogonal complement of H in R dm . 

We define Fold(y) to be the projection of y onto 
H- 1 . ft is easy to see that the mapping Fold(-) can be 
performed in polynomial time. 

After the folding procedure, we can further en- 
force p(x) to have the property: 

For any h £ H and x £ M. d ™\p(x + h) = p(x). 

We call functions that have the above property 
"folded" . In particular for e = {u, v} £ E, c £ R, and 
i 6 [fc], a folded function p satisfies p(x + cb(e,i)) = 
p{x). If we viewp(y) as a polynomial only on y^u 1 and 
yi^ for j £ (ir e )~ 1 (i), then Lemma I5T71 shows that we 
have the following folding property of p: 

Ji) - V rO') 

l "u / j '-v ■ 

If we sum over all possible i, this implies for any 
edge {it, v}, we have 

E^=E^- 

i£[k] i£[m] 

Now we are ready to prove Theorem 11.21 We will 
show the following two properties of the reduction to 
complete the proof. 

Lemma 5.5 (Completeness). If Opt(£) = I, then 
there is a folded function p(x) that is consistent with 
1 — 1/ log m fraction of the labeled examples generated 
by the reduction. 

Lemma 5.6 (Soundness). If Opt(£) ^ l/mP, then 
there is no folded degree-2 polynomial that is consis- 
tent with 1/2+ t a m fraction of the labeled examples 
generated by the reduction. 

Combining Lemmas 15.51 and 15.61 and noticing 
that m can be an arbitrarily large constant (such 
as e 1 / 6 ),we obtain Theorem 11.21 (A discretization 
similar to that of Section |43] is also required, and can 
be obtained in a routine way by slightly modifying the 
parameters of that section's construction.) 



Proof of Theorem 15. 5t Suppose that 
Opt(£) = 1, so there is a labeling I satisfying 
all the edges. Then consider the following function 

p(*)= E *2 (w)) - 

wGUUV 

For every v £ V, the function p v is a dictator and 
passes %n with probability at least 1 — 5—= — by 

lOg 771 ^ 

Lemma 15.11 Consequently the overall probability 
that p passes the test is at least 1 — 1/logm. Finally, 
it is easy to check that thus function p(x) is folded. 

□ 

Proof of Theorem 15.01 Suppose that there 
is some folded degree-2 polynomial p(x) such that 
sign(p(a;)) agrees with more than | + lo ^ m fraction 
of the example, i.e., the averaging passing probability 
of p v on % n is I + lo | m ■ We will show that Opt(£) > 
1 /m' 1 and thus prove the theorem. 

By an averaging argument, we know for a l0 g m 
fraction of the vertices v £ V, the restricted polyno- 
mial p v passes the test Tk with probability at laest 
5 + lo g m ; we refer to any such v as a "good" vertex. 
We say that an edge is "good" if the ^-endpoint of 
the edge is a good vertex. Since the graph is regular, 
we know that at least a — fraction of all edges are 

log m ° 

"good" . 

For a "good" vertex v, let us define I v to be 
I v = {j\je[m},cW >^c«/m 2 }. 

7=1 

By Lemma 15. 2[ we have \I V \ ^ (logm) 2 and 
Siefm] c ^ > 0- For every u £ U, we define J u = 

c$ /k}. We note that J u is 

not empty as 

maxc4 j) > ^2 c u[i]/k- 
3 »e[fc] 

We define the following labeling strategy for C. 
For u £ U, randomly assign it a label from J u ; for 
v £ V, randomly assign it a label from /„ (if is 
empty, we assign a random label to v). 

For every good edge e = (u, v) and any j £ J u , 
since p is folded, we have that 

E 4° = £ <£>/*■ 

There is at least one label i in tt^ 1 ^) such that 
J2ie[ m ] /km > Y^ie[m] Cv /rn 2 , and this label is 
therefore in /„. As noted earlier we have |7„| ^5 
(logm) 2 , and so by our randomized labeling strategy 



there is at least a l/(logm) 2 probability that edge 
{u, v} is satisfied. 

Therefore the above labeling strategy satisfies (in 
expectation) at least l/(logm) 2 fraction of the good 
edges and consequently at least l/(logm) 3 fraction of 
all edges. This means that Opt(£) > 1/m 1 ' and the 
proof is complete. □ 

5.2.1 Folding Lemma 
Lemma 5.7. Let 

n 

f(x) = 9 + W i X i + J! W i] X i X J 
i=0 0<i^j'^n 

be a degree 2 function. Suppose that for every x € 
R n ,c € K we have f(x + c(l, -1, . . . , -1)) = f(x). 
Then w = Yh=i Wi - 

Proof. Expanding the equality f(x + 
c(l, — 1, . . . , —1)) = f{x), we get that 

n 

9 + Wq(xq + c) + 2J Wi(Xi - c) + wq (xq + c) 2 
i=l 

n 

+ ^w 0:j (xQ + c)(x : j-c)+ ^ Wij(Xi-c)(Xj-c) 

n 

= 9 + W{Xi + WijXiXj. 
i=0 O^i^j^n 

Since this equation holds for all c, x, if we express 
the LHS and RHS as polynomials in the variables 
c, xq, X\, . . . , x n , the corresponding coefficients must 
be the same. If we look at the coefficients of the 
degree-1 monomial c, we have that Wo — Y^i=i w i = 0, 
and the lemma is proved. □ 

6 Conclusion 

We have established two hardness results for proper 
agnostic learning of low-degree PTFs. Our results 
show that even if there exist low-degree PTFs that are 
almost perfect hypotheses, it is computationally hard 
to find low-degree PTF hypotheses that perform even 
slightly better than random guessing; in this sense 
our hardness are rather strong. However, our results 
do not rule out the possibility of efficient learning 
algorithms when e is sub-constant, or if unrestricted 
hypotheses may be used. Strengthening the hardness 
results along these lines is an important goal for 
future work, but may require significantly new ideas. 

Another natural goal for future work is the 
following technical strengthening of our results: show 
that for any constant d, it is hard to construct a 
degree-d PTF that is consistent with (| + e) fraction 



of a given set of labeled examples, even if there exists 
a halfspace that is consistent with a 1 — e fraction of 
the data. Such a hardness result would subsume both 
of the results of this paper as well as much prior work, 
and would serve as strong evidence that agnostically 
learning halfspaces under arbitrary distributions is a 
computationally hard problem. 

Appendix 

A Probability inequalities 

We will use the Berry-Esseen Theorem, which is a 
quantitative version of the Central Limit Theorem: 

Theorem A.l. (Berry-Esseen Theorem) Let 
xi 1 X2,-..,x n be i.i.d. uniform { — 1, l}-valued 
random variables. Let c\ , . . . , c n G K be such that 
Ylj—X cf — 1 and max^ |<3j| ?C r. Let g denote a unit 
Gaussian variable drawn from N(0, 1). Then for any 
9 € K, we have 

n 
i=l 

We will also use the following anti-concentration 
result for low-degree polynomials over Gaussian ran- 
dom variables, due to Carbery and Wright: 

Theorem A. 2 ( 5j). Let p : R n — > R be a nonzero 
degree-d polynomial over the reals. Then for all 
t > 0, we have 

Pr x ^[\p( x )\^r\\p\\ 2 }i:0(dT^ d ). 
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